Tibet Autonomous Region
Sun-Shine: A Large Language Model for Tibetan Culture
Huang, Cheng, Gao, Fan, Tashi, Nyima, Liu, Yutong, Wang, Xiangxiang, Tsering, Thupten, Ma-bao, Ban, Duojie, Renzeg, Luosang, Gadeng, Dongrub, Rinchen, Tashi, Dorje, Feng, Xiao, Yu, Yongbin
Tibetan, a minority language in China, features a highly intricate grammatical structure, characterized by four verb tenses and a tense system with frequent irregularities, contributing to its extensive inflectional diversity. Recently, advances in Large Language Models (LLMs) have transformed the paradigm in many domains. Despite the success in other fields, current LLMs often fall short in catering to the needs of domain experts like Tibetans, and the potential of LLMs for Tibetan culture is under-explored. The intrinsic reasons are the immense and intricate nature of Tibetan culture as well as the necessity for higher granularity and richness in knowledge. Simultaneously, the complexity and uniqueness of its grammatical structure, coupled with its status as a minority ethnic language, contribute to data scarcity, which remains a fundamental challenge. To alleviate these issues, we introduce Llama-Sunshine (Sun-Shine), the first large language model for Tibetan culture, which is expert in various Tibetan language processing tasks. Sun-Shine incorporates state-of-the-art model architectures optimized for Tibetan's linguistic features. We also propose TIB-STC, a comprehensive dataset comprising diverse Tibetan texts such as literature, religious scripts, news, and conversational data, which is also the first large-scale dataset for Tibetan culture. Though comprehensive experiments, Sun-Shine not only demonstrates a higher level of knowledge expertise for Tibetan culture but also gains preliminary embodied intelligence capabilities in Tibetan language processing tasks, like language modeling, text classification, machine translation, and syntactic analysis. Moreover, it excels in low-resource scenarios, showcasing strong generalization capabilities.
TSCheater: Generating High-Quality Tibetan Adversarial Texts via Visual Similarity
Cao, Xi, Gesang, Quzong, Sun, Yuan, Qun, Nuo, Nyima, Tashi
Language models based on deep neural networks are vulnerable to textual adversarial attacks. While rich-resource languages like English are receiving focused attention, Tibetan, a cross-border language, is gradually being studied due to its abundant ancient literature and critical language strategy. Currently, there are several Tibetan adversarial text generation methods, but they do not fully consider the textual features of Tibetan script and overestimate the quality of generated adversarial texts. To address this issue, we propose a novel Tibetan adversarial text generation method called TSCheater, which considers the characteristic of Tibetan encoding and the feature that visually similar syllables have similar semantics. This method can also be transferred to other abugidas, such as Devanagari script. We utilize a self-constructed Tibetan syllable visual similarity database called TSVSDB to generate substitution candidates and adopt a greedy algorithm-based scoring mechanism to determine substitution order. After that, we conduct the method on eight victim language models. Experimentally, TSCheater outperforms existing methods in attack effectiveness, perturbation magnitude, semantic similarity, visual similarity, and human acceptance. Finally, we construct the first Tibetan adversarial robustness evaluation benchmark called AdvTS, which is generated by existing methods and proofread by humans.
Human-in-the-Loop Generation of Adversarial Texts: A Case Study on Tibetan Script
Cao, Xi, Sun, Yuan, Li, Jiajun, Gesang, Quzong, Qun, Nuo, Nyima, Tashi
DNN-based language models perform excellently on various tasks, but even SOTA LLMs are susceptible to textual adversarial attacks. Adversarial texts play crucial roles in multiple subfields of NLP. However, current research has the following issues. (1) Most textual adversarial attack methods target rich-resourced languages. How do we generate adversarial texts for less-studied languages? (2) Most textual adversarial attack methods are prone to generating invalid or ambiguous adversarial texts. How do we construct high-quality adversarial robustness benchmarks? (3) New language models may be immune to part of previously generated adversarial texts. How do we update adversarial robustness benchmarks? To address the above issues, we introduce HITL-GAT, a system based on a general approach to human-in-the-loop generation of adversarial texts. HITL-GAT contains four stages in one pipeline: victim model construction, adversarial example generation, high-quality benchmark construction, and adversarial robustness evaluation. Additionally, we utilize HITL-GAT to make a case study on Tibetan script which can be a reference for the adversarial research of other less-studied languages.
Multi-Granularity Tibetan Textual Adversarial Attack Method Based on Masked Language Model
Cao, Xi, Qun, Nuo, Gesang, Quzong, Zhu, Yulei, Nyima, Trashi
In social media, neural network models have been applied to hate speech detection, sentiment analysis, etc., but neural network models are susceptible to adversarial attacks. For instance, in a text classification task, the attacker elaborately introduces perturbations to the original texts that hardly alter the original semantics in order to trick the model into making different predictions. By studying textual adversarial attack methods, the robustness of language models can be evaluated and then improved. Currently, most of the research in this field focuses on English, and there is also a certain amount of research on Chinese. However, there is little research targeting Chinese minority languages. With the rapid development of artificial intelligence technology and the emergence of Chinese minority language models, textual adversarial attacks become a new challenge for the information processing of Chinese minority languages. In response to this situation, we propose a multi-granularity Tibetan textual adversarial attack method based on masked language models called TSTricker. We utilize the masked language models to generate candidate substitution syllables or words, adopt the scoring mechanism to determine the substitution order, and then conduct the attack method on several fine-tuned victim models. The experimental results show that TSTricker reduces the accuracy of the classification models by more than 28.70% and makes the classification models change the predictions of more than 90.60% of the samples, which has an evidently higher attack effect than the baseline method.
Research on Tibetan Tourism Viewpoints information generation system based on LLM
Qi, Jinhu, Yan, Shuai, Zhang, Wentao, Zhang, Yibo, Liu, Zirui, Wang, Ke
Tibet, ensconced within China's territorial expanse, is distinguished by its labyrinthine and heterogeneous topography, a testament to its profound historical heritage, and the cradle of a unique religious ethos. The very essence of these attributes, however, has impeded the advancement of Tibet's tourism service infrastructure, rendering existing smart tourism services inadequate for the region's visitors. This study delves into the ramifications of informational disparities at tourist sites on Tibetan tourism and addresses the challenge of establishing the Large Language Model (LLM) evaluation criteria. It introduces an innovative approach, the DualGen Bridge AI system, employing supervised fine-tuning techniques to bolster model functionality and enhance optimization processes. Furthermore, it pioneers a multi-structured generative results assessment framework. Empirical validation confirms the efficacy of this framework. The study also explores the application of the supervised fine-tuning method within the proprietary DualGen Bridge AI, aimed at refining the generation of tourist site information. The study's findings offer valuable insights for optimizing system performance and provide support and inspiration for the application of LLM technology in Tibet's tourism services and beyond, potentially revolutionizing the smart tourism industry with advanced, tailored information generation capabilities.
First Mapping the Canopy Height of Primeval Forests in the Tallest Tree Area of Asia
Fan, Guangpeng, Yan, Fei, Zeng, Xiangquan, Xu, Qingtao, Wang, Ruoyoulan, Zhang, Binghong, Zhou, Jialing, Nan, Liangliang, Wang, Jinhu, Zhang, Zhiwei, Wang, Jia
We have developed the world's first canopy height map of the distribution area of world-level giant trees. This mapping is crucial for discovering more individual and community world-level giant trees, and for analyzing and quantifying the effectiveness of biodiversity conservation measures in the Yarlung Tsangpo Grand Canyon (YTGC) National Nature Reserve. We proposed a method to map the canopy height of the primeval forest within the world-level giant tree distribution area by using a spaceborne LiDAR fusion satellite imagery (Global Ecosystem Dynamics Investigation (GEDI), ICESat-2, and Sentinel-2) driven deep learning modeling. And we customized a pyramid receptive fields depth separable CNN (PRFXception). PRFXception, a CNN architecture specifically customized for mapping primeval forest canopy height to infer the canopy height at the footprint level of GEDI and ICESat-2 from Sentinel-2 optical imagery with a 10-meter spatial resolution. We conducted a field survey of 227 permanent plots using a stratified sampling method and measured several giant trees using UAV-LS. The predicted canopy height was compared with ICESat-2 and GEDI validation data (RMSE =7.56 m, MAE=6.07 m, ME=-0.98 m, R^2=0.58 m), UAV-LS point clouds (RMSE =5.75 m, MAE =3.72 m, ME = 0.82 m, R^2= 0.65 m), and ground survey data (RMSE = 6.75 m, MAE = 5.56 m, ME= 2.14 m, R^2=0.60 m). We mapped the potential distribution map of world-level giant trees and discovered two previously undetected giant tree communities with an 89% probability of having trees 80-100 m tall, potentially taller than Asia's tallest tree. This paper provides scientific evidence confirming southeastern Tibet--northwestern Yunnan as the fourth global distribution center of world-level giant trees initiatives and promoting the inclusion of the YTGC giant tree distribution area within the scope of China's national park conservation.
PEFTT: Parameter-Efficient Fine-Tuning for low-resource Tibetan pre-trained language models
Mingjun, Zhou, Zhuoma, Daiqing, Nuo, Qun, Tashi, Nyima
In this era of large language models (LLMs), the traditional training of models has become increasingly unimaginable for regular users and institutions. The exploration of efficient fine-tuning for high-resource languages on these models is an undeniable trend that is gradually gaining popularity. However, there has been very little exploration for various low-resource languages, such as Tibetan. Research in Tibetan NLP is inherently scarce and limited. While there is currently no existing large language model for Tibetan due to its low-resource nature, that day will undoubtedly arrive. Therefore, research on efficient fine-tuning for low-resource language models like Tibetan is highly necessary. Our research can serve as a reference to fill this crucial gap. Efficient fine-tuning strategies for pre-trained language models (PLMs) in Tibetan have seen minimal exploration. We conducted three types of efficient fine-tuning experiments on the publicly available TNCC-title dataset: "prompt-tuning," "Adapter lightweight fine-tuning," and "prompt-tuning + Adapter fine-tuning." The experimental results demonstrate significant improvements using these methods, providing valuable insights for advancing Tibetan language applications in the context of pre-trained models.
Tibet dying a 'slow death' under Chinese rule, says exiled leader
Exiled Tibetan leaders and officials in the United States have condemned China's "cruel" policies in Tibet, accusing Beijing of separating families in the Himalayan region, banning their language, and engaging in non-consensual DNA collection. Addressing the US Congress for the first time, Penpa Tsering, the head of the India-based organisation known as Tibet's government in exile, said on Tuesday that Tibet was dying a "slow death" under Chinese rule. "We often get asked why we don't hear about Tibet any more," said Tsering, known as the Sikyong of the Central Tibetan Administration (CTA). He blamed that silence on China's "Orwellian gridlock system, use of all means of artificial intelligence to surveil people, control the flow of information and lockdown of Tibet to the outside world". "Tibetan language, religion and culture are the bedrock of Tibetan identity โฆ These are facing the unprecedented threat of eradication," he told the bipartisan Congressional-Executive Commission on China hearing via video link.
China deploys armed robotic vehicles during standoff with India to deal with cold, difficult terrain: reports
Fox News national security correspondent Jennifer Griffin discusses a report alleging China is developing'brain control weapons' on'Fox Report.' Reports from India claim that China has started to deploy armed robotic vehicles to handle the altitude and terrain that has proven too difficult for its troops. China and India clashed in Sept. 2020 during a border dispute along the southern coast of Pangong Lake in an area known in China as Shenpaoshan and in India as Chushul, but the armies continued their standoff along the two nations' borders throughout 2021. China has now reportedly deployed unmanned ground vehicles (UGV) to the region of Tibet to strengthen its position. People's Liberation Army (PLA) soldiers march next to the entrance to the Forbidden City during the opening ceremony of the Chinese People's Political Consultative Conference (CPPCC) in Beijing on May 21, 2020.
China replaces soldiers with machinegun-carrying robots in Tibet
China is deploying machinegun-carrying robots to its western desert regions amid a standoff with India because troops are struggling with the high-altitude conditions, it has been claimed. Dozens of unmanned vehicles capable of carrying both weapons and supplies are being sent to Tibet, Indian media reports, with the majority deployed in border regions where Chinese troops are locked into a standoff with Indian soldiers. Vehicles include the Sharp Claw, which is mounted with a light machinegun and can be operated wirelessly, and the Mule-200, which is designed as an unmanned supply vehicle but can also be fitted with weapons. Beijing has sent 88 Sharp Claws to Tibet, which borders India high in the Himalayas, of which 38 are deployed to the border region, Times News Now has claimed. Some 120 Mule-200s have also been sent to Tibet, News Now reports, with a majority of them deployed to the border area.